Semi-automated Xml Tagging of Public Text Archives: a Case Study
نویسندگان
چکیده
Public archives contain large and continuously growing volumes of electronically available text documents. In many countries, public authorities are required by law to publish certain data to satisfy the information needs of the general public. In contrast to plain text documents, semantically tagged XML documents along with appropriate query languages largely facilitate searching and browsing in public archives for interested citizens. However, transforming textual legacy data into semantically annotated XML documents should be automated to minimize costly human effort. In this paper, we present the DIAsDEM framework for semi-automated semantic tagging of domain-specific text documents in a case study. Our framework includes a complex knowledge discovery process that groups structural text units (e.g., sentences) based on similarity of their content, derives semantic labels for qualitatively acceptable clusters, semantically tags text units and derives a preliminary unstructured XML DTD for the archive. We apply this framework to collections of publicly available Commercial Register archives and finally evaluate the quality of our approach.
منابع مشابه
Employing Text Mining for Semantic Tagging in DIAsDEM
Both public and private organizations have been accumulating large volumes of electronically available text documents for the past years. However, to turn text archives into profitable sources of knowledge, they should be transformed into an integrated and efficiently queryable information system. To attain this objective, the project DIAsDEM employs data mining techniques to derive a semantic ...
متن کاملExtraction of Semantic XML DTDs from Texts Using Data Mining Techniques
Although composed of unstructured texts, documents contained in textual archives such as public announcements, patient records and annual reports to shareholders often share an inherent though undocumented structure. In order to facilitate efficient, structure-based search in archives and to enable information integration of text collections with related data sources, this inherent structure sh...
متن کاملThe DIAsDEM Framework for Converting Domain-Specific Texts into XML Documents with Data Mining Techniques
Modern organizations are accumulating huge volumes of textual documents. To turn archives into valuable knowledge sources, textual content must become explicit and queryable. Semantic tagging with markup languages such as XML satisfies both requirements. We thus introduce the DIAsDEM framework for extracting semantics from structural text units (e.g., sentences), assigning XML tags to them and ...
متن کاملAutomated Ontology Creation using XML Schema Elements
Ontologies are commonly used to represent formal semantics in a computer system, usually capturing them in the form of concepts, relationships and axioms. Axioms convey asserted knowledge and support inferring new knowledge through logical reasoning. For complex systems, the process of creating ontologies manually can be tedious and error-prone. Many automated methods of knowledge discovery are...
متن کامل